DiskReduce: Replication as a Prelude to Erasure Coding in Data-Intensive Scalable Computing
نویسندگان
چکیده
The first generation of Data-Intensive Scalable Computing file systems such as Google File System and Hadoop Distributed File System employed n (n ≥ 3) replications for high data reliability, therefore delivering users only about 1/n of the total storage capacity of the raw disks. This paper presents DiskReduce, a framework integrating RAID into these replicated storage systems to significantly reduce the storage capacity overhead, for example, from 200% to 25% when triplicated data is dynamically replaced with RAID sets (e.g. 8 + 2 RAID 6 encoding). Based on traces collected from Yahoo!, Facebook and Opencloud cluster, we analyze (1) the capacity effectiveness of simple and not so simple strategies for grouping data blocks into RAID sets; (2) implication of reducing the number of data copies on read performance and how to overcome the degradation; and (3) different heuristics to mitigate “small write penalties”. Finally, we introduce an implementation of our framework that has been built and submitted into the Apache Hadoop project.
منابع مشابه
DiskReduce: RAID for Data-Intensive Scalable Computing (CMU-PDL-09-112)
Data-intensive file systems, developed for Internet services and popular in cloud computing, provide high reliability and availability by replicating data, typically three copies of everything. Alternatively high performance computing, which has comparable scale, and smaller scale enterprise storage systems get similar tolerance for multiple failures from lower overhead erasure encoding, or RAI...
متن کاملData Replication-Based Scheduling in Cloud Computing Environment
Abstract— High-performance computing and vast storage are two key factors required for executing data-intensive applications. In comparison with traditional distributed systems like data grid, cloud computing provides these factors in a more affordable, scalable and elastic platform. Furthermore, accessing data files is critical for performing such applications. Sometimes accessing data becomes...
متن کاملDiskReduce: RAIDing the Cloud
Data-Intensive Scalable Computing (DISC) file systems such as HDFS employs replication for reliability, typically delivering users with only about a third of the storage capacity of the raw disks. In this project, we investigate DiskReduce, a framework for integrating RAID into these replicated storage systems to lower storage capacity overhead, for example, from 200% to 25% when triplicated da...
متن کاملDistributed File System Based on Erasure Coding for I/O-Intensive Applications
Distributed storage systems take advantage of the network, storage and computational resources to provide a scalable infrastructure. But in such large system, failures are frequent and expected. Data replication is the common technique to provide fault-tolerance but suffers from its important storage consumption. Erasure coding is an alternative that offers the same data protection but reduces ...
متن کاملIStore: Towards High Efficiency, Performance, and Reliability in Distributed Data Storage with Information Dispersal Algorithms
Reliability is one of the major challenges for high performance computing and cloud computing. Data replication is a commonly used mechanism to achieve high reliability. Unfortunately, it has a low storage efficiency among other shortcomings. As an alternative to data replication, information dispersal algorithms offer higher storage efficiency, but at the cost of being too computing-intensive ...
متن کامل